setwd('~/')
setwd('~/Documents/UdacityDAND/EDAFinalProject')
ww <- read.csv('wineQualityWhites.csv')
# install.packages('Psych')
library(GGally)
library(psych)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(gridExtra)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following objects are masked from 'package:stats':
## 
##     contrasts, contr.sum, contr.treatment
## The following object is masked from 'package:base':
## 
##     as.array
head(ww)
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

Introduction

I am to examine the data and what all variables and attributes it contains.

This report explores a dataset containing attributes for 4,898 white wines with 13 which includes 11 variables on quantifying the chemical properties of each wine.

Univariate Plots

nrow(ww)
## [1] 4898
ncol(ww)
## [1] 13
str(ww)
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
summary(ww)
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
qplot(x = fixed.acidity, data = ww, binwidth = .1) +
  scale_x_continuous(limits = c(4, 10), breaks = seq(4, 10, .5))
## Warning: Removed 9 rows containing non-finite values (stat_bin).

This is a noraml curve and gives a fair understanding of the distribution. This distribution is unimodal with the fixed acidity peaking around 6.8. There were some outliers before fixed acidity value of 4 and beyond 10 which has been removed. According to waterhouse most wines have tartaric acid value between 1 g/dm^3 and 4 g/dm^3. Is there a strong correlation between fixed acidity and pH value? Now let’s explore what the plots look like for other variables.

qplot(x = volatile.acidity, data = ww, binwidth = .01) +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## Warning: Removed 2 rows containing non-finite values (stat_bin).

This is also a unimodal, peaking around volatile acidity value of 0.28. Waterhouse claims that average acetic acid value is less than 400 mg/L. This is in sync with our dataset. The legal limit of acetic acid in US for white wine is 1.1 g/dm^3. Too much acetic acid can result in unpleasant aromas. In addition to undesirable aromas, both acetic acid and acetaldehyde are toxic to Saccharomyces cerevisiae and may lead to stuck fermentations.

qplot(x = citric.acid, data = ww, binwidth = .01) + 
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## Warning: Removed 2 rows containing non-finite values (stat_bin).

This distribution is also normal with citric acid value peaking around 0.3. Why is there a sudden peak at arounf 0.49?

According to waterhouse one would expect to see 0 to 500mg/L citric acid. This might be why the value peaks at around 0.49-0.5.

qplot(x = residual.sugar, data = ww, binwidth = .1) +
  scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2))
## Warning: Removed 5 rows containing non-finite values (stat_bin).

I observe a long tail distribution there are some extreme outliers around 30s and 70s which has been removed in the graph. According to winefolly.com: < 1 g/L(d/dm^3) - Bone Dry 1 to 10 g/L - Dry 10 to 35 g/L - Off-Dry 35 to 120 g/L - Sweet Wine 120 to 220 g/L - Very Sweet Wine

We can conclude that most of the wines in the data set are Dry wines.

A dry wine is when the yeast eats up all the sugar that is available and makes ethanol as a by product. This is why some sweet wines have less alcohol than its dry counterpart. We can look at the correlation between residual sugar content and alcohol. Is this an inverse relationship?

qplot(x = residual.sugar, data = ww) +
  scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2)) +
  scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The transformed distribution is bimodal and peaks at two places. First around 4 and then around 9. What do these peaks represent?

qplot(x = chlorides, data = ww, bin = .01) +
  scale_x_continuous(limits = c(0, 0.1), breaks = seq(0, .1, .01))
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 110 rows containing non-finite values (stat_bin).

Majority of the values lies between 0 and 1. This is also a normal distribution with peak at around 0.4. Most wines have a salt content of less than 0.1.

qplot(x = free.sulfur.dioxide, data = ww, bin = 10) +
  scale_x_continuous(limits = c(0, 150), breaks = seq(0, 150, 10))
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Free Sulfur Dioxide seems like a normal distribution with its peak at approximately 30. Most wines have a Sulphur Dioxide content of less than 100.

qplot(x = total.sulfur.dioxide, data = ww, bin = 30) +
  scale_x_continuous(limits = c(0, 320), breaks = seq(0, 320, 20))
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).

Total Sulfur Dioxide Value is a normal distribution with a peak around 120s. Sulfites is used to preserve wines. Most people can easily digest sulfites but some people have extremem allergic reactions to sulfites. According to waterhouse the average sulfite content in wine is around 80 mg/L which is almost in sync with the dataset. S02 content above 50 is detectable in the nose and taste of wine. Given this, there are lots of wine in the dataset where SO2 content might become evident in the nose and taste of wine.

qplot(x = density, data = ww, binwidth = .001) +
  scale_x_continuous(limits = c(.985, 1.015), breaks = seq(.985, 1.015, .005))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Density seems to follow a normal distribution with peak at nearly 0.992. There are a few outliers as well.

qplot(x = pH, data = ww, binwidth = .05) +
  scale_x_continuous(limits = c(2.7, 4), breaks = seq(2.7, 4, .1))

pH seems to follow a normal distribution with peak at nearly 3.15. According to Dr.Vinny’s post in winespectartor.com, the ideal pH value for white wines is around 3.0-3.4.

qplot(x = sulphates, data = ww, binwidth = .01) +
  scale_x_continuous(limits = c(0.2, 1.1), breaks = seq(0.2, 1.1, .05))

Normal distribution with a peak at .5. Potassium sulphate is the additive which will contribute to sulfur dioxide gas, which acts as an antimicrobial and antioxident.

 qplot(x = alcohol, data = ww, binwidth = .1) +
  scale_x_continuous(limits = c(8, 14.5), breaks = seq(8, 14.5, .5))

White wines have a distribution between 8.5% and 14%, with concentration between 9% and 10.5%.

qplot(x = quality, data = ww) +
  geom_bar()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most of the wines are given a quality score of 6. These values might be biased in many ways as it is a sensory data and completely subjective. The data might vary if a different set of experts is used for this.

Let’s look at all variable valus by quality:

qplot(x = fixed.acidity, data = ww, binwidth = .1) +
  scale_x_continuous(limits = c(4, 10), breaks = seq(4, 10, .5)) +
  facet_wrap(~ quality, nrow =5)
## Warning: Removed 9 rows containing non-finite values (stat_bin).

The fixed acidity (tartarc acid) for wines of different quality peaks between 6 and 8 g/L

qplot(x = volatile.acidity, data = ww, binwidth = .01) +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1)) +
  facet_wrap(~ quality)
## Warning: Removed 2 rows containing non-finite values (stat_bin).

qplot(x = citric.acid, data = ww, binwidth = .01) + 
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1)) +
  facet_wrap(~ quality)
## Warning: Removed 2 rows containing non-finite values (stat_bin).

qplot(x = residual.sugar, data = ww, binwidth = .1) +
  scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2)) +
  facet_wrap(~ quality)
## Warning: Removed 5 rows containing non-finite values (stat_bin).

qplot(x = residual.sugar, data = ww) +
  scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2)) +
  scale_x_log10() +
  facet_wrap(~quality)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(x = chlorides, data = ww, bin = .01) +
  scale_x_continuous(limits = c(0, 0.1), breaks = seq(0, .1, .01)) +
  facet_wrap(~ quality)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 110 rows containing non-finite values (stat_bin).

qplot(x = free.sulfur.dioxide, data = ww, bin = 10) +
  scale_x_continuous(limits = c(0, 150), breaks = seq(0, 150, 10)) +
  facet_wrap(~ quality)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

qplot(x = total.sulfur.dioxide, data = ww, bin = 30) +
  scale_x_continuous(limits = c(0, 320), breaks = seq(0, 320, 20)) +
  facet_wrap(~ quality)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).

qplot(x = density, data = ww, binwidth = .001) +
  scale_x_continuous(limits = c(.985, 1.015), breaks = seq(.985, 1.015, .005)) +
  facet_wrap(~ quality)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 7 rows containing missing values (geom_bar).

qplot(x = pH, data = ww, binwidth = .05) +
  scale_x_continuous(limits = c(2.7, 4), breaks = seq(2.7, 4, .1)) +
  facet_wrap(~ quality)

qplot(x = sulphates, data = ww, binwidth = .01) +
  scale_x_continuous(limits = c(0.2, 1.1), breaks = seq(0.2, 1.1, .05)) +
  facet_wrap(~ quality)

 qplot(x = alcohol, data = ww, binwidth = .1) +
  scale_x_continuous(limits = c(8, 14.5), breaks = seq(8, 14.5, .5)) +
  facet_wrap(~ quality)

ww$dryness <- ifelse(ww$residual.sugar < 1, "Bone Dry", ifelse((ww$residual.sugar>=1) & (ww$residual.sugar < 10), "Dry", ifelse((ww$residual.sugar >= 10) & (ww$residual.sugar < 35), "Off Dry",  ifelse((ww$residual.sugar >=35) & (ww$residual.sugar<120), "Sweet", "Very Sweet"))))
qplot(x = dryness, data = ww) +
  geom_bar()

# Univariate Analysis

What is the structure of your dataset?

The data set consists of 4,898 variants of the Portuguese White Wine “Vinho Verde”, with measurements of eleven chemical properties:

Fixed Acidity: acid that contributes to the conservation of wine. Volatile Acidity: Amount of acetic acid in wine at high levels can lead to an unpleasant taste of vinegar. Citric Acid: found in small amounts, can add “freshness” and flavor to wines. Residual sugar: amount of sugar remaining after the end of the fermentation. Chlorides: amount of salt in wine. Free Sulfur Dioxide: it prevents the increase of microbes and the oxidation of the wine. Total Sulfur Dioxide: it shows the aroma and taste of the wine. Density: density of water, depends on the percentage of alcohol and amount of sugar. pH: describes how acid or basic a wine is on a scale of 0 to 14. Sulfates: additive that acts as antimocrobian and antioxidant. Alcohol: percentage of alcohol present in the wine.

And a sensorial property: - Quality: grade between 0 and 10 given by specialists.

Observations: - Most wines have medium quality (5 and 6) - There’s no evident predictor of quality from the univariate analysis

What is/are the main feature(s) of interest in your dataset?

The main features in the data set is quality which is also our dependent variable. I’d like to determine which features are best for predicting the quality of wine. I suspect some combination of the chemical properties variables can be used to build a predictive model to determine the quality of White wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It is very difficult to predict quality from the given variable at first glance. I did not notice any significant relationship even after facet wrapping various variables according to quality. Perhaps I could investigate further by taking residual sugar relations with other properties as a starting point to further my investigation.

Did you create any new variables from existing variables in the dataset?

I created a new variable called dryness which is based on the residual sugar content as mentioned below: < 1 g/L(d/dm^3) - Bone Dry 1 to 10 g/L - Dry 10 to 35 g/L - Off-Dry 35 to 120 g/L - Sweet Wine 120 to 220 g/L - Very Sweet Wine

Most of the wines are Dry in nature.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

It was necessary to remove anomalies and extreme vales in some cases for better visualisations. Some properties like residual sugar and density had extreme values. In addition, the residual sugar of the white wine presented a long tail distribution. I used log10 transformation and got a bimodal distribution.

Bivariate Plots Section

ggpairs(ww, lower = list(continuous = wrap("points", shape = I('.'))), upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pairs.panels(ww[,-5], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

ggplot(aes(x = alcohol, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth()
## `geom_smooth()` using method = 'gam'

ggplot(aes(x = alcohol, y = quality), data = ww) +
  geom_jitter(alpha = 1/5)

ggplot(aes(x = fixed.acidity, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  scale_x_continuous(limits = c(4, 10), breaks = seq(4, 10, .5)) + geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).

ggplot(aes(x = fixed.acidity, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(4, 10), breaks = seq(4, 10, .5))
## Warning: Removed 10 rows containing missing values (geom_point).

ggplot(aes(x = volatile.acidity, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

ggplot(aes(x = volatile.acidity, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## Warning: Removed 2 rows containing missing values (geom_point).

ggplot(aes(x = citric.acid, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

ggplot(aes(x = citric.acid, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## Warning: Removed 14 rows containing missing values (geom_point).

ggplot(aes(x = residual.sugar, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

ggplot(aes(x = residual.sugar, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2))
## Warning: Removed 5 rows containing missing values (geom_point).

ggplot(aes(x = chlorides, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(0, 0.1), breaks = seq(0, .1, .01))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 110 rows containing non-finite values (stat_smooth).
## Warning: Removed 110 rows containing missing values (geom_point).

ggplot(aes(x = residual.sugar, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2))
## Warning: Removed 5 rows containing missing values (geom_point).

ggplot(aes(x = free.sulfur.dioxide, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(0, quantile(ww$free.sulfur.dioxide, .99)), breaks = seq(0, 150, 10))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 43 rows containing non-finite values (stat_smooth).
## Warning: Removed 43 rows containing missing values (geom_point).

ggplot(aes(x = free.sulfur.dioxide, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(0, quantile(ww$free.sulfur.dioxide, .99)), breaks = seq(0, 150, 10))
## Warning: Removed 47 rows containing missing values (geom_point).

ggplot(aes(x = total.sulfur.dioxide, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(0, quantile(ww$total.sulfur.dioxide, .99)), breaks = seq(0, 250, 20))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 49 rows containing non-finite values (stat_smooth).
## Warning: Removed 49 rows containing missing values (geom_point).

ggplot(aes(x = total.sulfur.dioxide, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(0, quantile(ww$total.sulfur.dioxide, .99)), breaks = seq(0, 250, 20))
## Warning: Removed 51 rows containing missing values (geom_point).

ggplot(aes(x = density, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(.985, quantile(ww$density, .99)), breaks = seq(.985, 1.015, .005))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 49 rows containing non-finite values (stat_smooth).
## Warning: Removed 49 rows containing missing values (geom_point).

ggplot(aes(x = density, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(.985, quantile(ww$density, .99)), breaks = seq(.985, 1.015, .005))
## Warning: Removed 49 rows containing missing values (geom_point).

ggplot(aes(x = pH, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(2.7, quantile(ww$pH, .99)))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 43 rows containing non-finite values (stat_smooth).
## Warning: Removed 43 rows containing missing values (geom_point).

ggplot(aes(x = pH, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(2.7, quantile(ww$pH, .99)))
## Warning: Removed 46 rows containing missing values (geom_point).

ggplot(aes(x = sulphates, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(0.2, quantile(ww$sulphates, .99)))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 48 rows containing non-finite values (stat_smooth).
## Warning: Removed 48 rows containing missing values (geom_point).

ggplot(aes(x = sulphates, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(0.2, quantile(ww$sulphates, .99)))
## Warning: Removed 50 rows containing missing values (geom_point).

ggplot(aes(x = alcohol, y = quality), data = ww) +
  geom_point(alpha = 1/5) +
  geom_smooth() +
  scale_x_continuous(limits = c(8, quantile(ww$alcohol, .99)))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 41 rows containing non-finite values (stat_smooth).
## Warning: Removed 41 rows containing missing values (geom_point).

ggplot(aes(x = alcohol, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(8, quantile(ww$alcohol, .99)))
## Warning: Removed 50 rows containing missing values (geom_point).

ww$total.acidity <- ww$fixed.acidity + ww$volatile.acidity

ggplot(aes(x = total.acidity, y = quality), data = ww) +
  geom_point() +
  geom_smooth() +
  scale_x_continuous(limits = c(4, quantile(ww$total.acidity, .99)))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 49 rows containing non-finite values (stat_smooth).
## Warning: Removed 49 rows containing missing values (geom_point).

According to Waterhouse the total acidity is the sum of fixed and volatile acidity

ggplot(aes(x = total.acidity, y = quality), data = ww) +
  geom_jitter(alpha = 1/5) +
  scale_x_continuous(limits = c(4, quantile(ww$total.acidity, .99)))
## Warning: Removed 49 rows containing missing values (geom_point).

It is clear from above that alcohol has the strongest correlation with quality. Here are the noteworthy correlations involving quality. I had to utilize the integer version of the quality variable in order to calculate the correlations.

Quality and alcohol: 0.436 Quality and density: -0.307

However, both these correlations can’t be considered strong.

Let’s take a look at boxplots involving quality.

#Create a box plot for each variable
qp1 <- qplot(x = quality, y = fixed.acidity, data = ww, 
             geom = 'boxplot')
qp2 <- qplot(x = quality, y = volatile.acidity, data = ww, 
             geom = 'boxplot')
qp3 <- qplot(x = quality, y = citric.acid, data = ww, 
             geom = 'boxplot')
qp4 <- qplot(x = quality, y = residual.sugar, data = ww, 
             geom = 'boxplot')
qp5 <- qplot(x = quality, y = chlorides, data = ww, 
             geom = 'boxplot')
qp6 <- qplot(x = quality, y = free.sulfur.dioxide, data = ww, 
             geom = 'boxplot')
qp7 <- qplot(x = quality, y = total.sulfur.dioxide, data = ww, 
             geom = 'boxplot')
qp8 <- qplot(x = quality, y = density, data = ww, 
             geom = 'boxplot')
qp9 <- qplot(x = quality, y = pH, data = ww, 
             geom = 'boxplot')
qp10 <- qplot(x = quality, y = sulphates, data = ww, 
              geom = 'boxplot')
qp11 <- qplot(x = quality, y = alcohol, data = ww, 
              geom = 'boxplot')
grid.arrange(qp1,qp2,qp3,qp4,qp5,qp6,qp7,qp8,qp9,qp10,qp11)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

#Create a box plot for variables with highest correlation

grid.arrange(qp8,qp11)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

#Group the data by quality and then summarize by density
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:memisc':
## 
##     collect, recode, rename
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
quality.groups <- group_by(ww, quality)
winesByQuality <- summarize(quality.groups, mean_density = mean(density),
                               median_density = median(as.numeric(density)),
                               min_density = min(density),
                               max_density = max(density),
                               n = n())
winesByQuality
## # A tibble: 7 x 6
##   quality mean_density median_density min_density max_density     n
##     <int>        <dbl>          <dbl>       <dbl>       <dbl> <int>
## 1       3    0.9948840       0.994425     0.99110     1.00010    20
## 2       4    0.9942767       0.994100     0.98920     1.00040   163
## 3       5    0.9952626       0.995300     0.98722     1.00241  1457
## 4       6    0.9939613       0.993660     0.98758     1.03898  2198
## 5       7    0.9924524       0.991760     0.98711     1.00040   880
## 6       8    0.9922359       0.991640     0.98713     1.00060   175
## 7       9    0.9914600       0.990300     0.98965     0.99700     5

The median data again show that as quality increases, density values decrease.

In addition to evaluating the correlations related to quality, I also want to probe how other variables work with each other. Here are the correlations of note that do not involve quality:

Total sulfur dioxide and residual sugar: 0.401 Total sulfur dioxide and free sulfur dioxide: 0.616 Total sulfur dioxide and alcohol: -0.449 Density and residual sugar: 0.839 Alcohol and density: -0.780 Residual sugar and alcohol: -0.451 Fixed acidity and pH: -0.426

Density, alcohol, and residual sugar all appear to be strongly correlated to each other, so I am going to take a closer look at those plots.

dn1 <- ggplot(aes(x = density, y = residual.sugar), data = ww) +
  geom_point(alpha = 1/5) +
  xlim(quantile(ww$density, 0.01),
       quantile(ww$density, 0.99)) +
  ylim(quantile(ww$residual.sugar, 0.01),
       quantile(ww$residual.sugar, 0.99))
ac1 <- ggplot(aes(x = alcohol, y = density), data = ww) +
  geom_jitter(alpha = 1/5) + 
  xlim(quantile(ww$alcohol, 0.01),
       quantile(ww$alcohol, 0.99)) +
  ylim(quantile(ww$density, 0.01),
       quantile(ww$density, 0.99))
sg1 <- ggplot(aes(x = residual.sugar, y = alcohol), data = ww) +
  geom_point(alpha = 1/5) + 
  xlim(quantile(ww$residual.sugar, 0.01),
       quantile(ww$residual.sugar, 0.99)) +
  ylim(quantile(ww$alcohol, 0.01),
       quantile(ww$alcohol, 0.99))
grid.arrange(dn1, ac1, sg1)
## Warning: Removed 160 rows containing missing values (geom_point).
## Warning: Removed 201 rows containing missing values (geom_point).
## Warning: Removed 157 rows containing missing values (geom_point).

The correlations are very evident in the charts shown above. Sugar must be more dense than other ingredients in the wine, because higher density levels imply higher sugar quanity. Similarly, alcohol seems to imply lesser density. Lastly, alcohol and sugar may offset each other during the wine-making process, because lower levels of alcohol tend to have higher levels of sugar (and vice versa)

I also wants to make a special note about pH levels and acidity. All Three acidity values have strong correlation with pH. This is logical as higher pH value corresponds to lower acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I evaluated all the variables with out main feature variable quality and observed that alcohol content has a strong impact on quality. However, it is still loosely correlated. Another variable that slightly influence quality may be the density.

Initially, as alcohol content increases, quality decreases. Subsequently when alcohol content increases, quality increases. This is not a linear model as represented by the smoothing line.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I discovered strong correlations between alcohol, residual sugar and density. As alcohol content increases, density tends to decrease rather linearly. Furthermore, as residual sugar increases density also increases. A linear model fits this well. Finally, as residual sugar level rises alcohol level decreases. This was clarified by the literatue available online. I mainly referred to literature provided by waterhouse.

What was the strongest relationship you found?

The strongest correlation was seen between Density and Residual Sugar.

Multivariate Plots Section

ww$quality.cat <- factor(ww$quality)

ggplot(aes(x = alcohol, y = density, color = quality.cat), data = ww) + 
  geom_point(size = 1, position = 'jitter') +
  scale_color_brewer(type = 'seq',
                     guide = guide_legend(title = 'Quality', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) + 
  scale_x_continuous(limits = c(8, 14.5), breaks = seq(8, 14.5, .5)) + 
  scale_y_continuous(limits = c(.985, 1.015), breaks = seq(.985, 1.015, .005))
## Warning: Removed 3 rows containing missing values (geom_point).

You can see that the graph generally gets darker to the right. And the corellation between alcohol and quality and density and quality is evident.

ww$free.sulfur.dioxide.cat <- ifelse(ww$free.sulfur.dioxide <= 50, '<= 50mg/l, not noticeable', '> 50mg/l, noticeable')
ww$free.sulfur.dioxide.cat <- factor(ww$free.sulfur.dioxide.cat)

ggplot(ww, aes(quality.cat, alcohol, fill = free.sulfur.dioxide.cat)) +
    geom_jitter(alpha = 0.1) + 
    geom_boxplot() 

Given the sae quality, win without sulfur aroma is more likely to have higher alcohol level. For instance, wines that have a quality score of 6 and don’t have sulfur smell, the median alcohol by volume is 10.6% as compared to 9.6 % among wines with same quality score with evident sulfur smell represented by blue boxplots. Therefore, you are more likely to get better quality wine if sulfur level is unnoticeable.

ggplot(ww,aes(quality.cat, alcohol)) + 
  geom_boxplot(aes(fill= free.sulfur.dioxide.cat), alpha = 0.5) + 
  theme(legend.position=c(1,1),legend.justification=c(1,1)) + 
  xlab('Wine Quality') + 
  ylab('Alcohol (% by volume)')

ggplot(ww, aes(quality, fill= free.sulfur.dioxide.cat)) + 
  geom_density(alpha=.5) + 
  theme(legend.position = "none") + 
  xlab('Wine Quality')

m1 <- lm(quality ~ alcohol, data = ww)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + chlorides)
m4 <- update(m3, ~ . + fixed.acidity)
m5 <- update(m4, ~ . + volatile.acidity)
m6 <- update(m5, ~ . + pH)
m7 <- update(m6, ~ . + total.sulfur.dioxide)
m8 <- update(m7, ~ . + log(residual.sugar))
m9 <- update(m8, ~ . + citric.acid)
m10 <- update(m9, ~ . + free.sulfur.dioxide)
m11 <- update(m10, ~ . + sulphates)
mtable(m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11)
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = ww)
## m2: lm(formula = quality ~ alcohol + density, data = ww)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = ww)
## m4: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity, 
##     data = ww)
## m5: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity, data = ww)
## m6: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH, data = ww)
## m7: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide, data = ww)
## m8: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar), 
##     data = ww)
## m9: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) + 
##     citric.acid, data = ww)
## m10: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) + 
##     citric.acid + free.sulfur.dioxide, data = ww)
## m11: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) + 
##     citric.acid + free.sulfur.dioxide + sulphates, data = ww)
## 
## ==================================================================================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7            m8            m9           m10           m11       
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               2.582***    -22.492***    -21.150***    -31.387***    -47.652***    -47.870***    -43.543***     41.731***     42.639***     37.700***     47.757***  
##                            (0.098)       (6.165)       (6.162)       (6.355)       (6.195)       (6.222)       (6.510)      (11.223)      (11.284)      (11.294)      (11.437)    
##   alcohol                   0.313***      0.360***      0.343***      0.356***      0.405***      0.406***      0.408***      0.310***      0.308***      0.310***      0.296***  
##                            (0.009)       (0.015)       (0.015)       (0.015)       (0.015)       (0.015)       (0.015)       (0.018)       (0.019)       (0.019)       (0.019)    
##   density                                24.728***     23.671***     34.437***     50.909***     51.237***     46.805***    -39.975***    -40.902***    -36.049**     -46.226***  
##                                          (6.079)       (6.074)       (6.293)       (6.137)       (6.199)       (6.501)      (11.351)      (11.414)      (11.422)      (11.567)    
##   chlorides                                            -2.382***     -2.421***     -1.323*       -1.334*       -1.399**      -0.762        -0.808        -0.831        -0.818     
##                                                        (0.558)       (0.555)       (0.539)       (0.540)       (0.541)       (0.541)       (0.544)       (0.542)       (0.541)    
##   fixed.acidity                                                      -0.087***     -0.101***     -0.103***     -0.103***     -0.027        -0.029        -0.020        -0.014     
##                                                                      (0.014)       (0.014)       (0.015)       (0.015)       (0.017)       (0.017)       (0.017)       (0.017)    
##   volatile.acidity                                                                 -2.085***     -2.088***     -2.112***     -2.117***     -2.101***     -1.981***     -1.953***  
##                                                                                    (0.110)       (0.111)       (0.111)       (0.110)       (0.112)       (0.114)       (0.114)    
##   pH                                                                                             -0.031        -0.042         0.326***      0.332***      0.343***      0.317***  
##                                                                                                  (0.081)       (0.081)       (0.090)       (0.090)       (0.090)       (0.090)    
##   total.sulfur.dioxide                                                                                          0.001*        0.000         0.000        -0.001        -0.001*    
##                                                                                                                (0.000)       (0.000)       (0.000)       (0.000)       (0.000)    
##   log(residual.sugar)                                                                                                         0.225***      0.226***      0.210***      0.232***  
##                                                                                                                              (0.024)       (0.024)       (0.024)       (0.025)    
##   citric.acid                                                                                                                               0.075         0.057         0.037     
##                                                                                                                                            (0.097)       (0.096)       (0.096)    
##   free.sulfur.dioxide                                                                                                                                     0.004***      0.004***  
##                                                                                                                                                          (0.001)       (0.001)    
##   sulphates                                                                                                                                                             0.502***  
##                                                                                                                                                                        (0.099)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190         0.192         0.195         0.202         0.256         0.256         0.257         0.270         0.270         0.274         0.278     
##   adj. R-squared            0.190         0.192         0.195         0.201         0.255         0.255         0.256         0.269         0.269         0.272         0.276     
##   sigma                     0.797         0.796         0.795         0.792         0.764         0.764         0.764         0.757         0.757         0.755         0.754     
##   F                      1146.395       583.290       396.315       309.222       336.912       280.734       241.554       225.827       200.787       184.336       170.797     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -5831.127     -5822.011     -5802.684     -5629.932     -5629.861     -5627.322     -5584.491     -5584.187     -5570.814     -5557.825     
##   Deviance               3112.257      3101.773      3090.247      3065.956      2857.136      2857.053      2854.093      2804.611      2804.262      2788.992      2774.238     
##   AIC                   11684.782     11670.255     11654.021     11617.368     11273.865     11275.722     11272.645     11188.982     11190.373     11165.629     11141.649     
##   BIC                   11704.272     11696.241     11686.504     11656.348     11319.341     11327.694     11331.114     11253.948     11261.836     11243.588     11226.105     
##   N                      4898          4898          4898          4898          4898          4898          4898          4898          4898          4898          4898         
## ==================================================================================================================================================================================

No combinations of variables coulg give a good model to predict quality score. The R2 value is very low evn after including all variables. This is not a strong correlation.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this section, I tried to visualise some of the variables more concisely and precisely. Some of the insights into relationships between alcohol, density and residual sugars were strengthened.

Were there any interesting or surprising interactions between features?

It is interesting to note that the chemical properties trends of wines og 5 and below quality is almost the inverse of chemical property trends of wines of quality 6 and above. This might be due to the influence of an unknown variable which is not given in the dataset. Alternatively, there might be something that I have missed. The use of artificial flavouring and other chemical agents might give the same chemical properties for the low quality wines but different tastes.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I tried to fit a linear model into the dataset to predict the quality of white wine based on the features provided in the data set.

The model grew stronger as I added more features into the model. However, the linear model may not be the best way to represent this data. R2 values were too low and residuals were high. Using all the features provided is not very different from using only alcohol as a predictor, which was tried in the bivariate section. This might be because some of the features are correlated to each other.

To improve the model we might need to introduce new features into the model or new way to transform the data. Moreover, there might be a better method than linear to predict quality.

Final Plots and Summary

saq <- ggplot(aes(x = alcohol, y = quality), data = ww) +
  geom_jitter(alpha = 1/10) +
  xlab("Alcohol level (% by volume)") +
  ylab("Quality score (0 to 10)") +
  ggtitle("Scatterplot") +
  scale_x_continuous(breaks = seq(8,14,1))
bqa <- qplot(x = quality, y = alcohol,
             data = ww,
             geom = 'boxplot') +
  xlab("Quality score (0 to 10)") +
  ylab("Alcohol level (% by volume)") +
  ggtitle("Boxplot") +
  scale_y_continuous(breaks = seq(8,14,1))
grid.arrange(saq,bqa)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

The strongest correlation observed between the feature of interest and any other feature was with alcohol at 0.436. This relationship can be visualised using the above chart. We can see that the concentration of points is increasing from left to right. That means as alcohol level increases quality also increases. Taking a closer look at th box plot we realise that the increasing trend is not steady. Between quality 3 and 5 it is a negative relationship. It is also safe to assume that after 12.5% alcohol content the quality of wine will decrease because the alcohol taste will overpower the native wine taste.

ggplot(aes(x = density, y = alcohol, color = as.factor(quality)), data = ww) +
  geom_point() + 
  scale_color_brewer(type = 'qual') +
  xlim(quantile(ww$density, 0.01),
       quantile(ww$density, 0.99)) +
  xlab("Density (g / cm^3)") +
  ylab("Count of white wines") +
  ggtitle("Histogram of Density with Color set by Quality")
## Warning: Removed 98 rows containing missing values (geom_point).

This is a good visualisation of the relationship between alcohol, density and quality. I have removed the outliers to make the visualisation better. However, for some reason I am not able to color the visualisation.

Alcohol and density is a negative relationship. That means as alchol content increases density decreases. Also, the better quality wines are concentrated at the left top of the graph. The graph disperses in the middle and converges at the right bottom. This also hints that as density increases, quality of wine tends to decrease.

Refection

The White Wines dataset contains information of 4898 samples of Portugese white wine (Vinho Verde) across 11 chemical properties and a special feature called quality score which was evaluated by wine experts. I started by exploring individual variables in the dataset and went on to investigate relationship between each chemical property with quality, which was chosen as the main feature in my analysis. Eventually, I tried to create a linear model to predict the quality of wine given other chemical properties.

There was a trend between quality and alcohol. But the other variables did not produce a strong correlation with quality. However, the variables were more or less strongly correlated with each other. Thgis might also be the reason why I was not able to come up with a linear model that predicts the quality score straight away. Transformations might be a technique that might have worked but I could not identify a direction to go forward with. Alternatively, absence of other features in the data set might also be a reason why I wasn’t able to produce a good linear model in my analysis.

Some limitations of this data includes missing features like Glycerol, Tannin, Amino acids, minerals, etc. Another limitation is that the quality score is a very subjective indicator. A more robust database could have produced a better model.

Having said that, this is the first project in R. I have so much to learn and I am sure that as the course progresses I will be able to deliver better.